Skip to content

fix(node): gracefully clean up iota-node validator components #6831

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
May 27, 2025

Conversation

semenov-vladyslav
Copy link
Contributor

Description of change

After a validator node leaves the committee, a few resources in validator components are leaked including metrics and grpc server. If the node will try to re-join the committee it will crash due to resources already being in use.

This PR:

  • changes the way validator metrics are registered: a dedicated registry is used instead of default one;
  • adds a graceful shutdown of validator grpc server and metrics registry clean up during reconfiguration when a validator leaves the committee;
  • adds a reproducer test.

Notes

There are a few ways to handle metrics issue:

  • ignore AlreadyReg error while trying to re-register metrics with the default registry; requires special unwrap handling for each metric;
  • unregister individual metrics with the default registry; the existing components do not provide such functionality; requires unregistering all individual metrics;
  • use a dedicated registry and remove it entirely from the registry service once node is not validator anymore; this seems to be the easiest and cleanest way.

⚠️ There may be other undetected resources leaking that do not cause error if validator components are re-created.

Links to any relevant issues

Fixes #6277.

Type of change

  • Bug fix (a non-breaking change which fixes an issue)

How the change has been tested

scripts/simtest/cargo-simtest simtest --package iota-e2e-tests --test "reconfiguration_tests" --profile ci -- test_reconfig_with_same_validator

Change checklist

  • I have followed the contribution guidelines for this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have added tests that prove my fix is effective or that my feature works
  • I have checked that new and existing unit tests pass locally with my changes

Release Notes

  • Protocol:
  • Nodes (Validators and Full nodes): Fixed a bug causing iota-node to crash after re-joining the consensus committee as validator.
  • Indexer:
  • JSON-RPC:
  • GraphQL:
  • CLI:
  • Rust SDK:
  • REST API:

@semenov-vladyslav semenov-vladyslav requested review from a team as code owners May 9, 2025 09:34
Copy link

vercel bot commented May 9, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

4 Skipped Deployments
Name Status Preview Comments Updated (UTC)
apps-backend ⬜️ Ignored (Inspect) Visit Preview May 27, 2025 4:26pm
apps-ui-kit ⬜️ Ignored (Inspect) Visit Preview May 27, 2025 4:26pm
rebased-explorer ⬜️ Ignored (Inspect) Visit Preview May 27, 2025 4:26pm
wallet-dashboard ⬜️ Ignored (Inspect) Visit Preview May 27, 2025 4:26pm

@iota-ci iota-ci added core-protocol node Issues related to the Core Node team labels May 9, 2025
@alexsporn alexsporn added this to the v1.2.x - protocol v8 milestone May 19, 2025
@muXxer muXxer force-pushed the node/fix-6277-revalidator-crash branch from 3d14434 to f7d54b1 Compare May 22, 2025 08:28
Copy link
Contributor

@piotrm50 piotrm50 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Contributor

@bingyanglin bingyanglin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@muXxer muXxer force-pushed the node/fix-6277-revalidator-crash branch 3 times, most recently from 872ae68 to f7141f3 Compare May 26, 2025 14:22
@muXxer muXxer force-pushed the node/fix-6277-revalidator-crash branch from ea7d282 to 220f7f2 Compare May 27, 2025 16:15
@semenov-vladyslav semenov-vladyslav merged commit 38bc3fb into develop May 27, 2025
36 checks passed
@semenov-vladyslav semenov-vladyslav deleted the node/fix-6277-revalidator-crash branch May 27, 2025 17:47
semenov-vladyslav added a commit that referenced this pull request May 27, 2025
# Description of change

After a validator node leaves the committee, a few resources in
validator components are leaked including metrics and grpc server. If
the node will try to re-join the committee it will crash due to
resources already being in use.

This PR:
- changes the way validator metrics are registered: a dedicated registry
is used instead of default one;
- adds a graceful shutdown of validator grpc server and metrics registry
clean up during reconfiguration when a validator leaves the committee;
- adds a reproducer test.

## Notes

There are a few ways to handle metrics issue:
- ignore `AlreadyReg` error while trying to re-register metrics with the
default registry; requires special unwrap handling for each metric;
- unregister individual metrics with the default registry; the existing
components do not provide such functionality; requires unregistering all
individual metrics;
- use a dedicated registry and remove it entirely from the registry
service once node is not validator anymore; this seems to be the easiest
and cleanest way.

:warning: There may be other undetected resources leaking that do not
cause error if validator components are re-created.

## Links to any relevant issues

Fixes #6277.

## Type of change

- Bug fix (a non-breaking change which fixes an issue)

## How the change has been tested

```sh
scripts/simtest/cargo-simtest simtest --package iota-e2e-tests --test "reconfiguration_tests" --profile ci -- test_reconfig_with_same_validator
```

## Change checklist

- [x] I have followed the contribution guidelines for this project
- [x] I have performed a self-review of my own code
- [x] I have commented my code, particularly in hard-to-understand areas
- [x] I have added tests that prove my fix is effective or that my
feature works
- [x] I have checked that new and existing unit tests pass locally with
my changes

### Release Notes

- [ ] Protocol:
- [x] Nodes (Validators and Full nodes): Fixed a bug causing iota-node
to crash after re-joining the consensus committee as validator.
- [ ] Indexer:
- [ ] JSON-RPC:
- [ ] GraphQL:
- [ ] CLI:
- [ ] Rust SDK:
- [ ] REST API:

---------

Co-authored-by: muXxer <[email protected]>
alexsporn pushed a commit that referenced this pull request May 28, 2025
…onents (#6831) (#7093)

# Description of change

After a validator node leaves the committee, a few resources in
validator components are leaked including metrics and grpc server. If
the node will try to re-join the committee it will crash due to
resources already being in use.

This PR:
- changes the way validator metrics are registered: a dedicated registry
is used instead of default one;
- adds a graceful shutdown of validator grpc server and metrics registry
clean up during reconfiguration when a validator leaves the committee;
- adds a reproducer test.

## Notes

There are a few ways to handle metrics issue:
- ignore `AlreadyReg` error while trying to re-register metrics with the
default registry; requires special unwrap handling for each metric;
- unregister individual metrics with the default registry; the existing
components do not provide such functionality; requires unregistering all
individual metrics;
- use a dedicated registry and remove it entirely from the registry
service once node is not validator anymore; this seems to be the easiest
and cleanest way.

:warning: There may be other undetected resources leaking that do not
cause error if validator components are re-created.

## Links to any relevant issues

Fixes #6277.

## Type of change

- Bug fix (a non-breaking change which fixes an issue)

## How the change has been tested

```sh
scripts/simtest/cargo-simtest simtest --package iota-e2e-tests --test "reconfiguration_tests" --profile ci -- test_reconfig_with_same_validator
```

## Change checklist

- [x] I have followed the contribution guidelines for this project
- [x] I have performed a self-review of my own code
- [x] I have commented my code, particularly in hard-to-understand areas
- [x] I have added tests that prove my fix is effective or that my
feature works
- [x] I have checked that new and existing unit tests pass locally with
my changes

### Release Notes

- [ ] Protocol:
- [x] Nodes (Validators and Full nodes): Fixed a bug causing iota-node
to crash after re-joining the consensus committee as validator.
- [ ] Indexer:
- [ ] JSON-RPC:
- [ ] GraphQL:
- [ ] CLI:
- [ ] Rust SDK:
- [ ] REST API:

Co-authored-by: muXxer <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-protocol node Issues related to the Core Node team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Validator node crashes when it was a validator node before, and becomes a validator/committee member again
10 participants